146 ◾ Bioinformatics
resulted in a nonsynonymous codon that is translated into an amino acid with different
physicochemical properties. For instance, if a hydrophobic amino acid is replaced by
another hydrophobic amino acid, SIFT will predict that change is tolerated; however, if it is
substituted with a polar amino acid, the variant will be predicted as deleterious. SIFT algo-
rithm avails of the NCBI PSI-BLAST as it uses the translated protein as a query sequence
against a database of protein sequences. The search hit sequences are aligned using mul-
tiple sequence alignment (MSA) and the probabilities of all possible substitutions at each
position are computed forming position-specific scoring matrix (PSSM), where each entry
in the matrix represents the probability of observing an amino acid in that column of the
alignment. The probabilities are normalized based on the consensus amino acids. Then,
position with normalized probability ranges between 0 and 1. SIFT predicts that a SNV
with a probability between 0.0 and 0.05 on that position is deleterious and will affect the
function of the protein and a probability greater than 0.05 (>0.05) can be tolerated. SIFT
also measures conservation of the sequence using the median sequence conservation,
which ranges from 0 to log2(20) or from 0 to 4.32, where median sequence conservation
of 4.32 indicates that all sequences in the alignment are identical to each other, and hence,
any variant in this region will be predicted as damaging. SIFT also reports the number of
sequences at the variant position. The latest version of SIFT is SIFT 4G (SIFT for genomes),
which is faster and enables practical computations on reference genomes using precom-
puted databases and also it provides SIFT prediction for more organisms. Hundreds of
databases for different organisms are available.
Use the following steps to annotate variants using SIFT 4G on Linux terminal:
First, create a directory with the name of your choice or “sift4g” and change into it.
mkdir sift4g
cd sift4g
Open “https://sift.bii.a-star.edu.sg/sift4g/public/”. You will see databases of tens of organ-
isms. Scroll down to the database of your interest, open its folder, and download the appro-
priate database build into your working directory. Since we have variants called above
from human samples, we can download the latest human build GRCh38.78 by copying the
link and using “wget” command and then unzip it using “unzip” or follow the instructions.
wget https://sift.bii.a-star.edu.sg/sift4g/public/Homo_sapiens/
GRCh38.78.zip
unzip GRCh38.78.zip
Each chromosome will have three files: a compressed file with “gz” file extension, a region
file with “.region” file extension, and a chromosome statistics file with “.txt” file extension.
Download SIFT 4G Annotator Java executable file (.jar) in a directory or in your work-
ing directory:
wget https://github.com/paulineng/SIFT4G_Annotator/raw/master/
SIFT4G_Annotator.jar